Shyue-Ming TANG Yue-Li WANG Chien-Yi LI Jou-Ming CHANG
Generalized recursive circulant graphs (GRCGs for short) are a generalization of recursive circulant graphs and provide a new type of topology for interconnection networks. A graph of n vertices is said to be s-pancyclic for some $3leqslant sleqslant n$ if it contains cycles of every length t for $sleqslant tleqslant n$. The pancyclicity of recursive circulant graphs was investigated by Araki and Shibata (Inf. Process. Lett. vol.81, no.4, pp.187-190, 2002). In this paper, we are concerned with the s-pancyclicity of GRCGs.
A vertex subset F ⊆ V(G) is called a cyclic vertex-cut set of a connected graph G if G-F is disconnected such that at least two components in G-F contain cycles. The cyclic vertex connectivity is the cardinality of a minimum cyclic vertex-cut set. In this paper, we show that the cyclic vertex connectivity of the trivalent Cayley graphs TGn is equal to eight for n ≥ 4.
Takashi YOKOTA Kanemitsu OOTSU Takeshi OHKAWA
Interconnection network is one of the inevitable components in parallel computers, since it is responsible to communication capabilities of the systems. It affects the system-level performance as well as the physical and logical structure of the systems. Although many studies are reported to enhance the interconnection network technology, we have to discuss many issues remaining. One of the most important issues is congestion management. In an interconnection network, many packets are transferred simultaneously and the packets interfere to each other in the network. Congestion arises as a result of the interferences. Its fast spreading speed seriously degrades communication performance and it continues for long time. Thus, we should appropriately control the network to suppress the congested situation for maintaining the maximum performance. Many studies address the problem and present effective methods, however, the maximal performance in an ideal situation is not sufficiently clarified. Solving the ideal performance is, in general, an NP-hard problem. This paper introduces particle swarm optimization (PSO) methodology to overcome the problem. In this paper, we first formalize the optimization problem suitable for the PSO method and present a simple PSO application as naive models. Then, we discuss reduction of the size of search space and introduce three practical variations of the PSO computation models as repetitive model, expansion model, and coding model. We furthermore introduce some non-PSO methods for comparison. Our evaluation results reveal high potentials of the PSO method. The repetitive and expansion models achieve significant acceleration of collective communication performance at most 1.72 times faster than that in the bursty communication condition.
The exchanged hypercube, denoted by EH(s,t), is a graph obtained by systematically removing edges from the corresponding hypercube, while preserving many of the hypercube's attractive properties. Moreover, ring-connected topology is one of the most promising topologies in Wavelength Division Multiplexing (WDM) optical networks. Let Rn denote a ring-connected topology. In this paper, we address the routing and wavelength assignment problem for implementing the EH(s,t) communication pattern on Rn, where n=s+t+1. We design an embedding scheme. Based on the embedding scheme, a near-optimal wavelength assignment algorithm using 2s+t-2+⌊2t/3⌋ wavelengths is proposed. We also show that the wavelength assignment algorithm uses no more than an additional 25 percent of (or ⌊2t-1/3⌋) wavelengths, compared to the optimal wavelength assignment algorithm.
Ryuta KAWANO Hiroshi NAKAHARA Ikki FUJIWARA Hiroki MATSUTANI Michihiro KOIBUCHI Hideharu AMANO
End-to-end network latency has become an important issue for parallel application on large-scale high performance computing (HPC) systems. It has been reported that randomly-connected inter-switch networks can lower the end-to-end network latency. This latency reduction is established in exchange for a large amount of routing information. That is, minimal routing on irregular networks is achieved by using routing tables for all destinations in the networks. In this work, a novel distributed routing method called LOREN (Layout-Oriented Routing with Entries for Neighbors) to achieve low-latency with a small routing table is proposed for irregular networks whose link length is limited. The routing tables contain both physically and topologically nearby neighbor nodes to ensure livelock-freedom and a small number of hops between nodes. Experimental results show that LOREN reduces the average latencies by 5.8% and improves the network throughput by up to 62% compared with a conventional compact routing method. Moreover, the number of required routing table entries is reduced by up to 91%, which improves scalability and flexibility for implementation.
Naohisa FUKASE Yasuyuki MIURA Shigeyoshi WATANABE M.M. HAFIZUR RAHMAN
The high performance network-on-chip (NoC) router using minimal hardware resources to minimize the layout area is very essential for NoC design. In this paper, we have proposed a memory sharing method of a wormhole routed NoC architecture to alleviate the area overhead of a NoC router. In the proposed method, a memory is shared by multiple physical links by using a multi-port memory. In this paper, we have proposed a partial link-sharing method and evaluated the communication performance using the proposed method. It is revealed that the resulted communication performance by the proposed methods is higher than that of the conventional method, and the progress ratio of the 3D-torus network is higher than that of 2D-torus network. It is shown that the improvement of communication performance using partial link sharing method is achieved with slightly increase of hardware cost.
Junsuk PARK Nobuhiro SEKI Keiichi KANEKO
In the topologies for interconnected nodes, it is desirable to have a low degree and a small diameter. For the same number of nodes, a dual-cube topology has almost half the degree compared to a hypercube while increasing the diameter by just one. Hence, it is a promising topology for interconnection networks of massively parallel systems. We propose here a stochastic fault-tolerant routing algorithm to find a non-faulty path from a source node to a destination node in a dual-cube.
The Möbius cube is a variant of the hypercube. Its advantage is that it can connect the same number of nodes as a hypercube but with almost half the diameter of the hypercube. We propose an algorithm to solve the node-to-node disjoint paths problem in n-Möbius cubes in polynomial-order time of n. We provide a proof of correctness of the algorithm and estimate that the time complexity is O(n2) and the maximum path length is 3n-5.
Hiroshi NAKAHARA Tomoya OZAKI Hiroki MATSUTANI Michihiro KOIBUCHI Hideharu AMANO
The increase of recent non-recurrent engineering cost (design, mask and test cost) have made large System-on-Chip (SoC) difficult to develop especially with advanced technology. We radically explore an approach for cheap and flexible chip stacking by using Inductive coupling ThruChip Interface (TCI). In order to connect a large number of small chips for building a large scale system, novel chip stacking methods called the linear stacking and staggered stacking are proposed. They enable the system to be extended to x or/and y dimensions, not only to z dimension. Here, a novel chip staking layout, and its deadlock-free routing design for the case using single-core chips and multi-core chips are shown. The network with 256 nodes formed by the proposed stacking improves the latency of 2D mesh by 13.8% and the performance of NAS Parallel Benchmarks by 5.4% on average compared to that of 2D mesh.
Takashi YOKOTA Kanemitsu OOTSU Takeshi OHKAWA
State-of-the-art parallel computers, which are growing in parallelism, require a lot of things in their interconnection networks. Although wide spectrum of efforts in research and development for effective and practical interconnection networks are reported, the problem is still open. One of the largest issues is congestion control that intends to maximize the network performance in terms of throughput and latency. Throttling, or injection limitation, is one of the center ideas of congestion control. We have proposed a new class of throttling method, Entropy Throttling, whose foundation is entropy concept of packets. The throttling method is successful in part, however, its potentials are not sufficiently discussed. This paper aims at exploiting capabilities of the Entropy Throttling method via comprehensive evaluation. Major contributions of this paper are to introduce two ideas of hysteresis function and guard time and also to clarify wide performance characteristics in steady and unsteady communication situations. By introducing the new ideas, we extend the Entropy throttling method. The extended methods improve communication performance at most 3.17 times in the best case and 1.47 times in average compared with non-throttling cases in collective communication, while the method can sustain steady communication performance.
Yao HU Ikki FUJIWARA Michihiro KOIBUCHI
A number of parallel applications run on a high-performance computing (HPC) system simultaneously. Job mapping and scheduling become crucial to improve system utilization, because fragmentation prevents an incoming job from being assigned even if there are enough compute nodes unused. Wireless supercomputers and datacenters with free-space optical (FSO) terminals have been proposed to replace the conventional wired interconnection so that a diverse application workload can be better supported by changing their network topologies. In this study we firstly present an efficient job mapping by swapping the endpoints of FSO links in a wireless HPC system. Our evaluation shows that an FSO-equipped wireless HPC system can achieve shorter average queuing length and queuing time for all the dispatched user jobs. Secondly, we consider the use of a more complicated and enhanced scheduling algorithm, which can further improve the system utilization over different host networks, as well as the average response time for all the dispatched user jobs. Finally, we present the performance advantages of the proposed wireless HPC system under more practical assumptions such as different cabinet capacities and diverse subtopology packings.
Thao-Nguyen TRUONG Khanh-Van NGUYEN Ikki FUJIWARA Michihiro KOIBUCHI
System expandability becomes a major concern for highly parallel computers and data centers, because their number of nodes gradually increases year by year. In this context we propose a low-degree topology and its floor layout in which a cabinet or node set can be newly inserted by connecting short cables to a single existing cabinet. Our graph analysis shows that the proposed topology has low diameter, low average shortest path length and short average cable length comparable to existing topologies with the same degree. When incrementally adding nodes and cabinets to the proposed topology, its diameter and average shortest path length increase modestly. Our discrete-event simulation results show that the proposed topology provides a comparable performance to 2-D Torus for some parallel applications. The network cost and power consumption of DSN-F modestly increase when compared to the counterpart non-random topologies.
David KOCIK Yuki HIRAI Keiichi KANEKO
This paper proposes an algorithm that solves the node-to-set disjoint paths problem in an n-Möbius cube in polynomial-order time of n. It also gives a proof of correctness of the algorithm as well as estimating the time complexity, O(n4), and the maximum path length, 2n-1. A computer experiment is conducted for n=1,2,...,31 to measure the average performance of the algorithm. The results show that the average time complexity is gradually approaching to O(n3) and that the maximum path lengths cannot be attained easily over the range of n in the experiment.
Hon-Chan CHEN Tzu-Liang KUNG Yun-Hao ZOU Hsin-Wei MAO
In this paper, we investigate the fault-tolerant Hamiltonian problems of crossed cubes with a faulty path. More precisely, let P denote any path in an n-dimensional crossed cube CQn for n ≥ 5, and let V(P) be the vertex set of P. We show that CQn-V(P) is Hamiltonian if |V(P)|≤n and is Hamiltonian connected if |V(P)| ≤ n-1. Compared with the previous results showing that the crossed cube is (n-2)-fault-tolerant Hamiltonian and (n-3)-fault-tolerant Hamiltonian connected for arbitrary faults, the contribution of this paper indicates that the crossed cube can tolerate more faulty vertices if these vertices happen to form some specific types of structures.
Ahmed SHALABY Ikki FUJIWARA Michihiro KOIBUCHI
Recently network bandwidth becomes a performance concern particularly for collective communication since bisection bandwidths of supercomputers become far less than their full bisection bandwidths. In this context we propose the use of a network coding technique to reduce the number of unicasts and the size of data transferred in latency-sensitive collective communications in supercomputers. Our proposed network coding scheme has a hierarchical multicasting structure with intra-group and inter-group unicasts. Quantitative analysis show that the aggregate path hop counts by our hierarchical network coding decrease as much as 94% when compared to conventional unicast-based multicasts. We validate these results by cycle-accurate network simulations. In 1,024-switch networks, the network reduces the execution time of collective communications as much as 70%. We also show that our hierarchical network coding is beneficial for any packet size.
Kung-Jui PAI Jinn-Shyong YANG Sing-Chen YAO Shyue-Ming TANG Jou-Ming CHANG
Let T1,T2,...,Tk be spanning trees in a graph G. If, for any two vertices u,v of G, the paths joining u and v on the k trees are mutually vertex-disjoint, then T1,T2,...,Tk are called completely independent spanning trees (CISTs for short) of G. The construction of CISTs can be applied in fault-tolerant broadcasting and secure message distribution on interconnection networks. Hasunuma (2001) first introduced the concept of CISTs and conjectured that there are k CISTs in any 2k-connected graph. Unfortunately, this conjecture was disproved by Péterfalvi recently. In this note, we give a necessary condition for k-connected k-regular graphs with ⌊k/2⌋ CISTs. Based on this condition, we provide more counterexamples for Hasunuma's conjecture. By contrast, we show that there are two CISTs in 4-regular chordal rings CR(N,d) with N=k(d-1)+j under the condition that k ≥ 4 is even and 0 ≤ j ≤ 4. In particular, the diameter of each constructed CIST is derived.
Wen-Yin HUANG Jia-Jie LIU Jou-Ming CHANG Ro-Yu WU
An n-dimensional folded hypercube, denoted by FQn, is an enhanced n-dimensional hypercube with one extra link between nodes that have the furthest Hamming distance. Let FFv (respectively, FFe) denote the set of faulty nodes (respectively, faulty links) in FQn. Under the assumption that every fault-free node in FQn is incident to at least two fault-free links, Hsieh et al. (Inform. Process. Lett. 110 (2009) pp.41-53) showed that if |FFv|+|FFe| ≤ 2n-4 for n ≥ 3, then FQn-FFv-FFe contains a fault-free cycle of length at least 2n-2|FFv|. In this paper, we show that, under the same conditional fault model, FQn with n ≥ 5 can tolerate more faulty elements and provides the same lower bound of the length of a longest fault-free cycle, i.e., FQn-FFv-FFe contains a fault-free cycle of length at least 2n-2|FFv| if |FFv|+|FFe| ≤ 2n-3 for n ≥ 5.
Shi-Ze GUO Zhe-Ming LU Guang-Yu KANG Zhe CHEN Hao LUO
Small-world is a common property existing in many real-life social, technological and biological networks. Small-world networks distinguish themselves from others by their high clustering coefficient and short average path length. In the past dozen years, many probabilistic small-world networks and some deterministic small-world networks have been proposed utilizing various mechanisms. In this Letter, we propose a new deterministic small-world network model by first constructing a binary-tree structure and then adding links between each pair of brother nodes and links between each grandfather node and its four grandson nodes. Furthermore, we give the analytic solutions to several topological characteristics, which shows that the proposed model is a small-world network.
Shouyi YIN Yang HU Zhen ZHANG Leibo LIU Shaojun WEI
Hybrid wired/wireless on-chip network is a promising communication architecture for multi-/many-core SoC. For application-specific SoC design, it is important to design a dedicated on-chip network architecture according to the application-specific nature. In this paper, we propose a heuristic wireless link allocation algorithm for creating hybrid on-chip network architecture. The algorithm can eliminate the performance bottleneck by replacing multi-hop wired paths by high-bandwidth single-hop long-range wireless links. The simulation results show that the hybrid on-chip network designed by our algorithm improves the performance in terms of both communication delay and energy consumption significantly.
The on-chip interconnection network (OCIN) is an integrated solution for system-on-chip (SoC) designs. The buffer architecture and size, however, dominate the performance of OCINs and affect the design of routers. This work analyzes different buffer architectures and uses a data-link two-level FIFO (first-in first-out) buffer architecture to implement high-performance routers. The concepts of shared buffers and multiple accesses for buffers are developed using the two-level FIFO buffer architecture. The proposed two-level FIFO buffer architecture increases the utilities of the storage elements via the centralized buffer organization and reduces the area and power consumption of routers to achieve the same performance achieved by other buffer architectures. Depending on a cycle-accurate simulator, the proposed data-link two-level FIFO buffer can realize performance similar to that of the conventional virtual channels, while using 25% of the buffers. Consequently, the two-level FIFO buffer can achieve about 22% power reduction compared with the similar performance of the conventional virtual channels using UMC 65 nm CMOS technology.